Main Questions
Our main question:
How did opioid shipment rates differ between rural and urban regions over time around the US from 2006-2014?
####
Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
To cite this case study please use:
Wright, Carrie, and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://opencasestudies.github.io/ocs-bp-opioid-rural-urban/ocs_pop.html. Opioids in the United States (Version v1.0.0).
avocado update url ####
In this case study we will be examining the number of opioid pills (specifically oxycodone and hydrocodone, as they are the top two abused opioids) shipped to pharmacies and paractitionaers at the county-level around the United States (US) from 2006 to 2014.
This data comes from the DEA Automated Reports and Consolidated Ordering System (ARCOS) and was released by the Washington Post after legal action by the owner of the Charleston Gazette-Mail in West Virginia and the Washington Post.
We will investigate how the number of shipped pills compares for rural and urban counties. This analysis will demonstrate how different regions of the country may have been more at risk for opioid addiction crises due to differing rates of opioid prescription (using the number of pills as a proxy for perscription rates). This will help inform students about how evidence-based intervention decisions are made in this area.
This case study is motivated by this article:
García, M. C. et al. Opioid Prescribing Rates in Nonmetropolitan and Metropolitan Counties Among Primary Care Providers Using an Electronic Health Record System — United States, 2014–2017. MMWR Morb. Mortal. Wkly. Rep. 68, 25–30 (2019). DOI: 10.15585/mmwr.mm6802a1
This article explores rates of opioid perscriptions in rural and urban communties in the United States using the Athenahealth electronic health record (EHR) system for 31,422 primary care providers from January 2014 to March 2017.
The main takeaways from this article were:
Among 70,237 fatal drug overdoses in 2017, prescription opioids were involved in 17,029 (24.2%).
The percentage of patients prescribed an opioid was higher in rural than in urban areas.
Higher opioid prescribing rates put patients at risk for addiction and overdose.
Indeed, this was confirmed by another article which surveyed heroin users in the Survey of Key Informants’ Patients Program and the Researchers and Participants Interacting Directly (RAPID) program
Cicero, T. J., Ellis, M. S., Surratt, H. L. & Kurtz, S. P. The Changing Face of Heroin Use in the United States: A Retrospective Analysis of the Past 50 Years. JAMA Psychiatry 71, 821 (2014). DOI:10.1001/jamapsychiatry.2014.366
They found that:
Respondents who began using heroin in the 1960s were predominantly young men (82.8%; mean age, 16.5 years) whose first opioid of abuse was heroin (80%).
However, more recent users were older (mean age, 22.9 years) men and women living in less urban areas (75.2%) who were introduced to opioids through prescription drugs (75.0%).
Heroin use has changed from an inner-city, minority-centered problem to one that has a more widespread geographical distribution, involving primarily white men and women in their late 20s living outside of large urban areas.
A much greater percentage of heroin users completing the sur- vey in the SKIP Program reported currently living in small urban or nonurban areas than in large urban areas (75.2% vs 24.8%) at the time of survey completion.
This survey used self-declared area of current residence (large urban, small urban, suburban, or rural).
Photo by James Yarema on Unsplash
Our main question:
How did opioid shipment rates differ between rural and urban regions over time around the US from 2006-2014?
In this case study, we will demonstrate how to obtain data from an Application Programming Interface (API), which is an interface that allows users to more easily interact with a database. We will also especially focus on using packages and functions from the Tidyverse, such as dplyr, tidyr. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R more legible and intuitive.
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
Data science skills:
httr and jasonlite)NA values (tidyr)dplyrformattablenaniar)ggplot2ggiraph)Statistical concepts and methods:
We will begin by loading the packages that we will need:
library(readxl)
library(tibble)
library(httr)
library(jsonlite)
library(stringr)
library(magrittr)
library(dplyr)
library(tidyr)
library(naniar)
library(ggplot2)
library(forcats)
library(ggpol) #creates geomjitter
library(ggiraph) # creates interactive plot for plots that are difficult to label because they have so many elements
library(ggpubr) #ggarange probably dropping this
library(formattable)Packages used in this case study:
| Package | Use in this case study |
|---|---|
| readxl | to import an excel file |
| httr | to retrieve data from an API |
| tibble | to create tibbles (the tidyverse version of dataframes) |
| jsonlite | to parse json files |
| stringr | to manipulate character strings within the data (subset and detect parts of strings) |
| dplyr | to filter, subset, join, and modify and summarize the data |
| magrittr | to pipe sequential commands |
| tidyr | to change the shape or format of tibbles to wide and long |
| naniar | to get a sense of missing data |
| ggplot2 | to create plots |
| forcats | to reorder factor for plot |
| ggpol | to create plots that are have jitter and half boxplots |
| ggiraph | to create interactive plots |
| formattable | to create a formatted table |
The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
What exactly are opioids?
According to the DEA and the Alta Mira addiction treatment center:
Opioids (also known as narcotics which comes from the Greek word for “stupor”), describes a class of drugs that contain opium (the poppy plant - Papaver somniferum), are derived from opium, or contain a semi-synthetic or synthetic substitute for opium.
Photo by Ingo Doerrie on Unsplash
Hoewver, technically, opioids are substances that bind to the opioid receptors in the body, which are involved in the sensation of pain and the experience of reward. There are actually opioids that naturally are made by the human body, the most well known being the endorphins.
Oppoid drugs tend to be addictive becuase they modulate the reward system. This is the part of the brain that reinforces behaviors (normally these are behaviours such as drinking water or eating food) by causing the experience of pleasure (through the release of a neurotransmitter called dopamine).
This same system can be activated by both opioids that naturally occur in the body, as well as opioid perscription drugs and other addictive drugs. Activation of this sytem by drug use leads to very high releases of Dopamine and the sensation of pleasure which ultimately leads to reinforced use of that drug.
In general, opioids medications and drugs have been found to dull the senses, releive pain, supress cough, reduce respiration and heart rate, induce constipation, and induce feelings of euphoria. They have a high potential for abuse and addiction.
Drugs within this class include (with perscription drug brand names are shown in parentheses):
Opium comes from the fluid (which is also called poppy tears) inside the seed capusules of the Papaver somniferum plant. This contains morphine, codeine, and thebaine. This is then dried.
Opium has been used by humans since 5000 BCE and it has been used across the world. See here for an interesting overview of the history.
Opium derived medications were historically used in United States to treat a variety of ailments besides pain including: cholera, dysentery, tubuerculosi, and mental illness.
Of note, they state that “from 1898 to 1910 heroin was marketed as a non-addictive morphine substitute and cough medicine for children”!
Here you can see a photo of a bottle of herion:
Opioids have continued to be used in the treatment of pain.
The opioid epidemic began in the late 1990s.
According to the US department of health and human services (HHS):
In the late 1990s, pharmaceutical companies reassured the medical community that patients would not become addicted to opioid pain relievers and healthcare providers began to prescribe them at greater rates.
Increased prescription of opioid medications led to widespread misuse of both prescription and non-prescription opioids before it became clear that these medications could indeed be highly addictive.
In 2017 the HHS declared a public health emergency
See here for a timeline of the epidemic in the US and here for more details about the epidemic.
According to this article from the Morbidity and Mortality Weekly Report (MMWR) of the Centers for Disease Control and Prevention (CDC):
Drug overdose is the leading cause of unintentional injury-associated death in the United States.
You can see that moth recent overdose deaths were due to the use of synthetic opioids, where as previous high levels of overdoses (till about 2015) were attributable to natural and semi-synthetic opioids (which is what we will look at in this case study).
They state that:
From 1999–2018, almost 450,000 people died from an overdose involving any opioid, including prescription and illicit opioids.
Importantly rates appear to differ across states, according to this CDC report
According to the motivating report for our case study:
Perscription rates are now declining, however, perscription of opioids was found to be higher in rural areas rather than urban ares.
It is important to identify locations where people are particularly vulernable to target interventions for communities that need it the most.
There are some important considerations regarding this data analysis to keep in mind:
According to the [Washington Post data](https://www.washingtonpost.com/national/2019/07/18/how-download-use-dea-pain-pills-database/ about the DEA data:
“It’s important to remember that the number of pills in each county does not necessarily mean those pills went to people who live in that county. The data only shows us what pharmacies the pills are shipped to and nothing else.”
Furthermore, we will define counties as being rural or urban however there can be great variation within a county and we used land area values form only 2010 even though these can fluctuate. Therefore the way we categorized counties should be seen as an approximation.
Finally, overdose deaths are often due to the use of multiple substances. Simply because a county recieved more pills does not mean that people in that county would experience more drug overdoses. It is also important to remember that perscription opioids only account for a portion of the drug overdose deaths reported in this time period. However, according to this article 75% of heroin users surveyed were introduced to opioids through perscription drug use.
We will use data from two sources:
The US census for land area of counties to allow us to extimate county-level population density
The Washington Post datafrom the Drug Enforcement Administration (DEA) about opioid (oxycodone and hydrocodone) pill shipments to pharmacies and paractitionaers around the US at the county-level
This dataset was released in July of 2019 and has been controversial as according to the Washington Post:
The disclosure is part of a civil action brought by 2,500 cities, towns, counties and tribal nations alleging that nearly two dozen drug companies conspired to saturate the nation with opioids.
See here for more details about how this database was released.
The Washington Poststates that they:
.. cleaned the data to include only information on shipments of oxycodone and hydrocodone pills. We did not include data on 10 other opioids because they were shipped in much lower quantities…
It’s important to remember that the number of pills in each county does not necessarily mean those pills went to people who live in that county. The data only shows us what pharmacies the pills are shipped to and nothing else.
This data was part of the [Automated Reports and Consolidated Ordering System (ARCOS)]https://www.deadiversion.usdoj.gov/arcos/retail_drug_summary/ of the DEA in which:
manufacturers and distributors report their controlled substances transactions
Their website indicates that:
The Controlled Substances Act of 1970 created the requirement for Manufacturers and Distributors to report their controlled substances transactions to the Attorney General. The Attorney General delegates this authority to the Drug Enforcement Administration (DEA).
ARCOS is an automated, comprehensive drug reporting system which monitors the flow of DEA controlled substances from their point of manufacture through commercial distribution channels to point of sale or distribution at the dispensing/retail level - hospitals, retail pharmacies, practitioners, mid-level practitioners, and teaching institutions. Included in the list of controlled substance transactions tracked by ARCOS are the following: All Schedules I and II materials (manufacturers and distributors); Schedule III narcotic and gamma-hydroxybutyric acid (GHB) materials (manufacturers and distributors); and selected Schedule III and IV psychotropic drugs (manufacturers only).
The annual report about the data from 2019, can be found here.
As this is a very large dataset, thus the Washington Post created an application prgoramming interface (API) to make it easier for users to access the data.
An API is a computational interface that simplifies interactacts with a data or file system for a user. It is similar to a Graphical User Interface GUI, yet it allows the user some more flexibility/functionality.
This link takes you to the Washington Post ARCOS API.
There was also an R package on cran called arcos for interacting with the API, but this has been archived. This package is however still available here on Github.
See here for more information about how to get access the Washington Post DEA database.
We will need land area data for our calculations of population density.
We obtained county land area data from the US census Bureau at this link
This link explains how the data is formated.
We will use the read_excel() function of the readxl package to import the data. We will also convert the data into a tibble (which is a the tidyverse version of a data frame) by using the as_tibble() function of the tibble package.
We can take a look at the data using the base head() function which will show the frist 6 rows.
# A tibble: 6 x 34
Areaname STCOU LND010190F LND010190D LND010190N1 LND010190N2 LND010200F
<chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 UNITED … 00000 0 3787425. 0000 0000 0
2 ALABAMA 01000 0 52423. 0000 0000 0
3 Autauga… 01001 0 604. 0000 0000 0
4 Baldwin… 01003 0 2027. 0000 0000 0
5 Barbour… 01005 0 905. 0000 0000 0
6 Bibb, AL 01007 0 626. 0000 0000 0
# … with 27 more variables: LND010200D <dbl>, LND010200N1 <chr>,
# LND010200N2 <chr>, LND110180F <dbl>, LND110180D <dbl>, LND110180N1 <chr>,
# LND110180N2 <chr>, LND110190F <dbl>, LND110190D <dbl>, LND110190N1 <chr>,
# LND110190N2 <chr>, LND110200F <dbl>, LND110200D <dbl>, LND110200N1 <chr>,
# LND110200N2 <chr>, LND110210F <dbl>, LND110210D <dbl>, LND110210N1 <chr>,
# LND110210N2 <chr>, LND210190F <dbl>, LND210190D <dbl>, LND210190N1 <chr>,
# LND210190N2 <chr>, LND210200F <dbl>, LND210200D <dbl>, LND210200N1 <chr>,
# LND210200N2 <chr>
Looks good!
The httr package formats what are called “GET requests” so that they will work properly. This allows for the data to be retrieved from the API.
The jsonlite package alows you to convert the data from JSON (often used by APIs) to a differet format that is easier to work with.
APIs typically require a password or key to gain access. Thus the httr package helps to authenticate your data request. Often these keys are something that you do not want to share, unless the API is public.
In our case the API is indeed public, and currently “uO4EK6I” is publicly published as a key to use on the github page for the arcos package. We will use that here to access the API.
We are interested in the county level data - first let’s get the population data. We can access it by:
GET button on the API.This gives us the following output:
curl -X GET "https://arcos-api.ext.nile.works/v1/county_population?key=uO4EK6I" -H "accept: application/json"
We can copy the URL section "https://arcos-api.ext.nile.works/v1/county_population?key=uO4EK6I" and use it in the GET() function of the httr package :
If we needed to specify a username and password, we would do so using the authenticate() function of the httr package within the GET function. The authenticate() function takes user, password and type arguments.
Here is an example:
The default type is "basic" and typcally what is needed.
Now that we have used the GET function, we have a JavaScript Object Notation (JSON) file of the data.
JSON files are lightweight meaning that they don’t take up much memory and they are human readible files to make transmitting data from websites easier.
Response [https://arcos-api.ext.nile.works/v1/county_population?key=uO4EK6I]
Date: 2020-09-22 18:29
Status: 200
Content-Type: application/json
Size: 5.75 MB
Here we can see that the object called countyjson is a json object. You will also see that the Satus is 200, which means that we were sucessful in retreiving the data from the API.
Now we can use the content() funtion of the httr package to extract the text from the file:
This will be a very large string at this point, we can take a look at part of it by using the str_sub() function of the stringr package. In this case we will only look at the first 400 characters.
What is a string or a chracter?
Click here for an explanation about character strings if you are not yet familiar
There are several classes of data in R programming. Character is one of these classes. A character string is an individual data value made up of characters. This can be a paragraph, like the legend for the table, or it can be a single letter or number like the letter "a" or the number "3".
If data are of class character, than the numeric values will not be processed like a numeric value in a mathematical sense.
[1] "[{\"BUYER_COUNTY\":\"AUTAUGA\",\"BUYER_STATE\":\"AL\",\"countyfips\":\"01001\",\"STATE\":1,\"COUNTY\":1,\"county_name\":\"Autauga\",\"NAME\":\"Autauga County, Alabama\",\"variable\":\"B01003_001\",\"year\":2006,\"population\":51328},{\"BUYER_COUNTY\":\"BALDWIN\",\"BUYER_STATE\":\"AL\",\"countyfips\":\"01003\",\"STATE\":1,\"COUNTY\":3,\"county_name\":\"Baldwin\",\"NAME\":\"Baldwin County, Alabama\",\"variable\":\"B01003_001\",\"year\":2006,\"population\":168121"
Now to get the data into a more readible format, we can use the fromJSON() function of the jsonlite package and again create a tibble of the data using as_tibble()
county_pop <- jsonlite::fromJSON(county_pop_text, flatten = TRUE)
county_pop <- as_tibble(county_pop)We can use the glimpse() function and the distinct() function of the dplyr package to get a better sense of the data. The distinct() function allows us to take a look at the unique values of the year variable.
Rows: 28,265
Columns: 10
$ BUYER_COUNTY <chr> "AUTAUGA", "BALDWIN", "BARBOUR", "BIBB", "BLOUNT", "BULL…
$ BUYER_STATE <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "A…
$ countyfips <chr> "01001", "01003", "01005", "01007", "01009", "01011", "0…
$ STATE <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ COUNTY <int> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 3…
$ county_name <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "Bull…
$ NAME <chr> "Autauga County, Alabama", "Baldwin County, Alabama", "B…
$ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", …
$ year <int> 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 20…
$ population <int> 51328, 168121, 27861, 22099, 55485, 10776, 20815, 115388…
# A tibble: 9 x 1
year
<int>
1 2006
2 2007
3 2008
4 2009
5 2010
6 2011
7 2012
8 2013
9 2014
It looks like we have the full data from 2006-2014.
We are also interested in opioid pill shipment data at the county level.
Here is the result of the same steps using the API for the combined_county_annual data:
curl -X GET "https://arcos-api.ext.nile.works/v1/combined_county_annual?key=uO4EK6I" -H "accept: application/json"
Question Opportunity
See if you can fix import the data without looking at the code for the population data.
Click here to reveal the code.
county_annual_json<-httr::GET(url = "https://arcos-api.ext.nile.works/v1/combined_county_annual?key=uO4EK6I")
county_annual_json_text <-httr::content(county_annual_json, "text")
county_annual <- jsonlite::fromJSON(county_annual_json_text, flatten = TRUE)
annualDosage <- tibble::as_tibble(county_annual)Now let’s take a look at the data:
Rows: 27,758
Columns: 6
$ BUYER_COUNTY <chr> "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABB…
$ BUYER_STATE <chr> "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "L…
$ year <int> 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 20…
$ count <int> 877, 908, 871, 930, 1197, 1327, 1509, 1572, 1558, 5802, …
$ DOSAGE_UNIT <dbl> 363620, 402940, 424590, 467230, 539280, 566560, 589010, …
$ countyfips <chr> "45001", "45001", "45001", "45001", "45001", "45001", "4…
# A tibble: 9 x 1
year
<int>
1 2006
2 2007
3 2008
4 2009
5 2010
6 2011
7 2012
8 2013
9 2014
Looks like we have the same years of data.
Now let’s take a deaper look at the data to see if we have any missing data using the naniar package.
We can use the vis_miss() function to create a plot of missing data.
Let’s start with the land area data.
Looks like there is no missing data.
How about the population data:
We appear to be missing some values for the Name and variable data, but we don`t intend to use these, so this should be ok. It is however a good idea to check these rows to see if anything strange is happening.
Let’s use the filter() function of the dplyr package and the is.na() base function to see more about the data that does not have countyfips codes.
We will also start using the %>% pipe of the magrittr package for our assignments.
Click here if you are unfamiliar with piping in R, which uses this
%>% operator
By piping we mean using the %>% pipe operator which is accessible after loading the tidyverse or several of the packages within the tidyverse like dplyr because they load the magrittr package. This allows us to perform multiple sequential steps on one data input.
# A tibble: 15 x 10
BUYER_COUNTY BUYER_STATE countyfips STATE COUNTY county_name NAME variable
<chr> <chr> <chr> <int> <int> <chr> <chr> <chr>
1 PRINCE OF W… AK 02198 2 201 Prince of … <NA> <NA>
2 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
3 WRANGELL AK 02275 2 280 Wrangell <NA> <NA>
4 PRINCE OF W… AK 02198 2 201 Prince of … <NA> <NA>
5 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
6 WRANGELL AK 02275 2 280 Wrangell <NA> <NA>
7 PRINCE OF W… AK 02198 2 201 Prince of … <NA> <NA>
8 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
9 WRANGELL AK 02275 2 280 Wrangell <NA> <NA>
10 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
11 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
12 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
13 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
14 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
15 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
# … with 2 more variables: year <int>, population <int>
This looks ok. So let’s now move on to the DEA data.
Interesting, we appear to be missing countyfips codes for a small percentage of our annual data.
Let’s take a look at this data:
# A tibble: 760 x 6
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
<chr> <chr> <int> <int> <dbl> <chr>
1 ADJUNTAS PR 2006 147 102800 <NA>
2 ADJUNTAS PR 2007 153 104800 <NA>
3 ADJUNTAS PR 2008 153 45400 <NA>
4 ADJUNTAS PR 2009 184 54200 <NA>
5 ADJUNTAS PR 2010 190 56200 <NA>
6 ADJUNTAS PR 2011 186 65530 <NA>
7 ADJUNTAS PR 2012 138 57330 <NA>
8 ADJUNTAS PR 2013 138 65820 <NA>
9 ADJUNTAS PR 2014 90 59490 <NA>
10 AGUADA PR 2006 160 49200 <NA>
# … with 750 more rows
It looks like the missing data is data for Puerto Rico - it makes sense that it doesn’t have countyfips codes.
Let’s see if there is any data other than data for Puerto Rico that is also missing countyfips values. We can use the != operator which indicates not equal to.
# A tibble: 74 x 6
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
<chr> <chr> <int> <int> <dbl> <chr>
1 GUAM GU 2006 319 265348 <NA>
2 GUAM GU 2007 330 275600 <NA>
3 GUAM GU 2008 313 286900 <NA>
4 GUAM GU 2009 390 355300 <NA>
5 GUAM GU 2010 510 413800 <NA>
6 GUAM GU 2011 559 475600 <NA>
7 GUAM GU 2012 616 564800 <NA>
8 GUAM GU 2013 728 623200 <NA>
9 GUAM GU 2014 712 558960 <NA>
10 MONTGOMERY AR 2006 469 175390 <NA>
11 MONTGOMERY AR 2007 597 241270 <NA>
12 MONTGOMERY AR 2008 561 251760 <NA>
13 MONTGOMERY AR 2009 554 244160 <NA>
14 MONTGOMERY AR 2010 449 247990 <NA>
15 MONTGOMERY AR 2011 560 313800 <NA>
16 MONTGOMERY AR 2012 696 339520 <NA>
17 MONTGOMERY AR 2013 703 382300 <NA>
18 MONTGOMERY AR 2014 491 396900 <NA>
19 NORTHERN MARIANA ISLANDS MP 2006 165 117850 <NA>
20 NORTHERN MARIANA ISLANDS MP 2007 157 117500 <NA>
21 NORTHERN MARIANA ISLANDS MP 2008 204 143000 <NA>
22 NORTHERN MARIANA ISLANDS MP 2009 269 186900 <NA>
23 NORTHERN MARIANA ISLANDS MP 2010 231 196360 <NA>
24 NORTHERN MARIANA ISLANDS MP 2011 264 208500 <NA>
25 NORTHERN MARIANA ISLANDS MP 2012 290 217400 <NA>
26 NORTHERN MARIANA ISLANDS MP 2013 258 231400 <NA>
27 NORTHERN MARIANA ISLANDS MP 2014 260 239200 <NA>
28 PALAU PW 2006 5 14000 <NA>
29 PALAU PW 2007 9 26600 <NA>
30 PALAU PW 2008 2 7500 <NA>
31 PALAU PW 2009 3 10000 <NA>
32 PALAU PW 2013 1 1000 <NA>
33 SAINT CROIX VI 2006 544 198800 <NA>
34 SAINT CROIX VI 2007 612 237120 <NA>
35 SAINT CROIX VI 2008 694 254020 <NA>
36 SAINT CROIX VI 2009 601 233860 <NA>
37 SAINT CROIX VI 2010 764 316260 <NA>
38 SAINT CROIX VI 2011 756 320850 <NA>
39 SAINT CROIX VI 2012 755 314690 <NA>
40 SAINT CROIX VI 2013 802 328410 <NA>
41 SAINT CROIX VI 2014 684 269040 <NA>
42 SAINT JOHN VI 2006 65 22200 <NA>
43 SAINT JOHN VI 2007 60 21800 <NA>
44 SAINT JOHN VI 2008 70 24700 <NA>
45 SAINT JOHN VI 2009 58 23100 <NA>
46 SAINT JOHN VI 2010 75 23500 <NA>
47 SAINT JOHN VI 2011 89 30200 <NA>
48 SAINT JOHN VI 2012 85 30200 <NA>
49 SAINT JOHN VI 2013 66 22000 <NA>
50 SAINT JOHN VI 2014 63 20400 <NA>
51 SAINT THOMAS VI 2006 628 219100 <NA>
52 SAINT THOMAS VI 2007 757 249480 <NA>
53 SAINT THOMAS VI 2008 815 294250 <NA>
54 SAINT THOMAS VI 2009 798 313200 <NA>
55 SAINT THOMAS VI 2010 802 318630 <NA>
56 SAINT THOMAS VI 2011 932 383350 <NA>
57 SAINT THOMAS VI 2012 939 373280 <NA>
58 SAINT THOMAS VI 2013 988 376400 <NA>
59 SAINT THOMAS VI 2014 1021 314440 <NA>
60 <NA> AE 2006 2 330 <NA>
61 <NA> CA 2006 47 12600 <NA>
62 <NA> CT 2006 305 78700 <NA>
63 <NA> CT 2007 112 30900 <NA>
64 <NA> CT 2008 48 15000 <NA>
65 <NA> FL 2006 9 900 <NA>
66 <NA> FL 2007 7 700 <NA>
67 <NA> GA 2006 114 51700 <NA>
68 <NA> IA 2006 7 2300 <NA>
69 <NA> IN 2006 292 39300 <NA>
70 <NA> MA 2006 247 114900 <NA>
71 <NA> NV 2006 380 173600 <NA>
72 <NA> NV 2007 447 200600 <NA>
73 <NA> NV 2008 5 2200 <NA>
74 <NA> OH 2006 23 5100 <NA>
It looks like there is also data for other territories in the dataset, as well as some counties with no name.
For some reason the rows for the Montgomery county of Arkansa are also missing a countyfips value.
# A tibble: 9 x 6
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
<chr> <chr> <int> <int> <dbl> <chr>
1 MONTGOMERY AR 2006 469 175390 <NA>
2 MONTGOMERY AR 2007 597 241270 <NA>
3 MONTGOMERY AR 2008 561 251760 <NA>
4 MONTGOMERY AR 2009 554 244160 <NA>
5 MONTGOMERY AR 2010 449 247990 <NA>
6 MONTGOMERY AR 2011 560 313800 <NA>
7 MONTGOMERY AR 2012 696 339520 <NA>
8 MONTGOMERY AR 2013 703 382300 <NA>
9 MONTGOMERY AR 2014 491 396900 <NA>
According to this website thie fIPS code is 05097.
We will update these values in the next section.
We want the LND110210D column which is the data from the year 2010.
LND = Land Area 110 = unit square miles (subgroup-code of the group) * avocado I found this somehwere else.. the census info was vauge would like to confirm that that is indeed what the sugroup code shows us 2 = century 10 = 2010 (based on the century) D = Data
Thus we can select just the county names, the county numeric codes, and the LND110210Dcolumn by using the select() function of the dplyr package.
# A tibble: 3,198 x 3
Areaname STCOU LND110210D
<chr> <chr> <dbl>
1 UNITED STATES 00000 3531905.
2 ALABAMA 01000 50645.
3 Autauga, AL 01001 594.
4 Baldwin, AL 01003 1590.
5 Barbour, AL 01005 885.
6 Bibb, AL 01007 623.
7 Blount, AL 01009 645.
8 Bullock, AL 01011 623.
9 Butler, AL 01013 777.
10 Calhoun, AL 01015 606.
# … with 3,188 more rows
countyfipsWe will use the case_when() function of the dplyr package recode the NA values for the rows for the MONGOMERY county of AR to be 05097. First we need to specify for these particular rows. Becuase there Montgomery may be a county name in other states, we need to evaluate when the BUYER_STATE is AR and when the BUYER_COUNTY is MONTGOMERY. We will use the & opperator to indcate that both conditions must be true. We will then recode the coutryfips values for these rows to be "05097" using the ~ symbol. All other values need to stay the same. Thus we need to use TRUE ~ to recode all the other countyfips values to what they currently are. Otherwise these would autmatically be NA.
We are also going to use a special pipe operator from the magrittr package called the compound assignment pipe-operator or sometimes the double pipe operator.
This allows us to use the annualDosage as our input and reassign it at the end after all the subsequent steps have been performed, although in this case it is only one step.
annualDosage %<>%
mutate(countyfips = case_when(BUYER_STATE == "AR" &
BUYER_COUNTY == "MONTGOMERY" ~ as.character("05097"),
TRUE ~ countyfips))Now we can check that we indeed fixed our data.
# A tibble: 0 x 6
# … with 6 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, year <int>,
# count <int>, DOSAGE_UNIT <dbl>, countyfips <chr>
# A tibble: 9 x 6
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
<chr> <chr> <int> <int> <dbl> <chr>
1 MONTGOMERY AR 2006 469 175390 05097
2 MONTGOMERY AR 2007 597 241270 05097
3 MONTGOMERY AR 2008 561 251760 05097
4 MONTGOMERY AR 2009 554 244160 05097
5 MONTGOMERY AR 2010 449 247990 05097
6 MONTGOMERY AR 2011 560 313800 05097
7 MONTGOMERY AR 2012 696 339520 05097
8 MONTGOMERY AR 2013 703 382300 05097
9 MONTGOMERY AR 2014 491 396900 05097
Great! We fixed it.
OK, we also had some rows that didn’t have county names because they were just missing or the data was for US territories. We will remove the values that dont have county names.
First let’s take a look at them agian.
# A tibble: 17 x 6
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
<chr> <chr> <int> <int> <dbl> <chr>
1 <NA> AE 2006 2 330 <NA>
2 <NA> CA 2006 47 12600 <NA>
3 <NA> CT 2006 305 78700 <NA>
4 <NA> CT 2007 112 30900 <NA>
5 <NA> CT 2008 48 15000 <NA>
6 <NA> FL 2006 9 900 <NA>
7 <NA> FL 2007 7 700 <NA>
8 <NA> GA 2006 114 51700 <NA>
9 <NA> IA 2006 7 2300 <NA>
10 <NA> IN 2006 292 39300 <NA>
11 <NA> MA 2006 247 114900 <NA>
12 <NA> NV 2006 380 173600 <NA>
13 <NA> NV 2007 447 200600 <NA>
14 <NA> NV 2008 5 2200 <NA>
15 <NA> OH 2006 23 5100 <NA>
16 <NA> PR 2006 10 17800 <NA>
17 <NA> PR 2007 2 1300 <NA>
We can filter out these values by using the ! exclamation mark before the is.na() function.
And now let’s check that these NA values are gone:
# A tibble: 0 x 6
# … with 6 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, year <int>,
# count <int>, DOSAGE_UNIT <dbl>, countyfips <chr>
Let’s check if our land area data has information for US territories. If not, we will remove the data for the territories in our annualDosage data. However, this would be very interesting and imporatant to investigate. We can use the str_detect() function of the stringr package, which contains lots of functions for looking for patterns in character strings, to look for data from Puerto Rico.
The str_detect() function allows us to look for a particular pattern. It does not have to be the full value, there can be a partial match. Thus we can look to see if there are any PR strings withing the vlaues of the the Areaname variable.
# A tibble: 0 x 3
# … with 3 variables: Areaname <chr>, STCOU <chr>, LND110210D <dbl>
You can see using a different abbreviation, that this code does as intended:
# A tibble: 81 x 3
Areaname STCOU LND110210D
<chr> <chr> <dbl>
1 ARIZONA 04000 113594.
2 ARKANSAS 05000 52035.
3 Arkansas, AR 05001 989.
4 Ashley, AR 05003 925.
5 Baxter, AR 05005 554.
6 Benton, AR 05007 847.
7 Boone, AR 05009 590.
8 Bradley, AR 05011 649.
9 Calhoun, AR 05013 629.
10 Carroll, AR 05015 630.
# … with 71 more rows
OK, so it does mot look like there is any territory land area data in this dataset. Thus we will also remove these from the annualDosage and monthlyDosage tibbles.
Question Opportunity
Do you recall how to do this?
Click here to reveal the code.
Great! Now there is no missing data in our annual data.
Defining if a region is rural or urban is actually quite complicated as the overall population changes, the structure of our towns and cities changes, and the access between different locations changes over time. Please see this report form the US Census Beureau about the history of this definition.
According to several definitions - urban areas are often defined as those with greater than 50,000 people. However, there are also definitions of rural areas being based on “population densities of less than 500 people per square mile and places with fewer than 2,500 people”. Typically counties are made up of multiple areas.
The census estimates rural and urban areas around the US relatively often. However, census collections about these measuresments does not occur every year.
Thus we will define a county as rural or urban based on the population density using the USDA definition that we described above:
Ideally we would want land area from each year, as these do fluctuate a bit, however, this should be a decent approximation as 2010 is in the middle of our time span.
We will therefore calculate the density as the number of people per square mile by dividing the population values by the land area values. To do so we first need to combine our county_area and our county_pop data together. First we want to make sure that we have one column, in our case the column that contains the numeric code for the counties, in the same format and with the same name in both the tibbles that we wish to combine.
We can use the rename() function of the dplyr package to rename the STCOU column to be countyfips. The new name is always listed first before the old name with this function like so: rename(new_name = old_name).
We can use the mutate() funtion of the dplyr package to make the countyfips variable a factor in both tibbles.
What exactly is a factor?
Click here for an explanation of data classes in R
There are several classes of data in R programming. Character is one of these classes. A character string is an individual data value made up of characters. This can be a paragraph, like the legend for the table, or it can be a single letter or number like the letter "a" or the number "3".
If data are of class character, than the numeric values will not be processed like a numeric value in a mathematical sense.
If you want your numeric values to be interpreted that way, they need to be converted to a numeric class. The options typically used are integer (which has no decimal place) and double precision (which has a decimal place).
A variable that is a factor has a set of particular values called levels. Even if these are numeric, they will be interpreted as level not as a mathematical numnber. You can modify the order of these levels with the forcats package.
county_pop %<>%
mutate(countyfips = as.factor(countyfips))
county_area %<>%
mutate(countyfips = as.factor(countyfips))Great! Now we are ready to combine our data together.
We can do so using one of the *_join()functions of the dplyr package.
There are several ways to join data using the dplyr package.
See here for more details about joining data.
Since the population data came from the API, we probably have information about opioid pill shipments for each of the included counties. Since the land area data came from a different source, it may contain additional counties that are not in our population or drug shipment data. Thus we will use the left_join() function where x in this case will be the county_pop and y will be the country_area. Thus we will add the LND110210D (land area) values for all counties that match county_pop based on the countyfips column that they have in common.
We are now ready to calculate the population density per square mile. We can create a new column with this data using the mutate() function and the / to divide the population value by the land area value (in square miles) for each county. Let’s also make the year variable a factor.
county_info %<>%
mutate(density = population/LND110210D,
year = as.factor(year))
glimpse(county_info)Rows: 28,265
Columns: 13
$ BUYER_COUNTY <chr> "AUTAUGA", "BALDWIN", "BARBOUR", "BIBB", "BLOUNT", "BULL…
$ BUYER_STATE <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "A…
$ countyfips <fct> 01001, 01003, 01005, 01007, 01009, 01011, 01013, 01015, …
$ STATE <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ COUNTY <int> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 3…
$ county_name <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "Bull…
$ NAME <chr> "Autauga County, Alabama", "Baldwin County, Alabama", "B…
$ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", …
$ year <fct> 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 20…
$ population <int> 51328, 168121, 27861, 22099, 55485, 10776, 20815, 115388…
$ Areaname <chr> "Autauga, AL", "Baldwin, AL", "Barbour, AL", "Bibb, AL",…
$ LND110210D <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622.81, 776.83,…
$ density <dbl> 86.34681, 105.75111, 31.48563, 35.49584, 86.05261, 17.30…
Great, now we are ready to create a variable that classifies if a county was rural or urban based on our definition of rural counties being those with less than 500 people per square mile as well as those with less than 2,500 people. We will use the case_when() function of the dplyr package to calssify the new rural_urban variable as either "Urban" or "Rural" based on the evaluations of the density and the population variables. If the density is greater than or equal to 500 people per square mile, then the county will be coded as "Urban", alternatively if the density is less than 500 people per square mile or the population is less than 2500, than the county will be coded as "Rural". The | opperator is used to indicate that either expression should result in coding the county as "Rural"
county_info %<>%
mutate(rural_urban = case_when(density >= 500 ~ "Urban",
density < 500 | population < 2500 ~ "Rural"))We can use the count() function of the dplyr package to see how many of each this resulted in:
# A tibble: 2 x 2
rural_urban n
<chr> <int>
1 Rural 26065
2 Urban 2200
We will now combine the annualDosage data with the count_info tibble.
Question Opportunity
How might we do this?
Click here to reveal the code.
Rows: 27,007
Columns: 16
$ BUYER_COUNTY <chr> "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABB…
$ BUYER_STATE <chr> "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "L…
$ year <fct> 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 20…
$ count <int> 877, 908, 871, 930, 1197, 1327, 1509, 1572, 1558, 5802, …
$ DOSAGE_UNIT <dbl> 363620, 402940, 424590, 467230, 539280, 566560, 589010, …
$ countyfips <fct> 45001, 45001, 45001, 45001, 45001, 45001, 45001, 45001, …
$ STATE <int> 45, 45, 45, 45, 45, 45, 45, 45, 45, 22, 22, 22, 22, 22, …
$ COUNTY <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ county_name <chr> "Abbeville", "Abbeville", "Abbeville", "Abbeville", "Abb…
$ NAME <chr> "Abbeville County, South Carolina", "Abbeville County, S…
$ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", …
$ population <int> 25821, 25745, 25699, 25347, 25643, 25515, 25387, 25233, …
$ Areaname <chr> "Abbeville, SC", "Abbeville, SC", "Abbeville, SC", "Abbe…
$ LND110210D <dbl> 490.48, 490.48, 490.48, 490.48, 490.48, 490.48, 490.48, …
$ density <dbl> 52.64435, 52.48940, 52.39561, 51.67795, 52.28144, 52.020…
$ rural_urban <chr> "Rural", "Rural", "Rural", "Rural", "Rural", "Rural", "R…
Great, now we should have the data that we need for the case study.
Notice how there is a variable called DOSAGE_UNIT. This indicates the number of pills shipped to a pharmacy in this county that were either oxycodone or hydrocodone.
Let’s do a check to see how complete our data is now that we have combined our country_info data with the monthlyDosage and annualDosage data. We will have NA values for any counties present in the DAE data but not in our land area data. We can use the vis_miss() function naniar package to create a plot that shows if we have any missing data.
# A tibble: 27 x 16
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips STATE COUNTY
<chr> <chr> <fct> <int> <dbl> <fct> <int> <int>
1 MONTGOMERY AR 2006 469 175390 05097 NA NA
2 MONTGOMERY AR 2007 597 241270 05097 NA NA
3 MONTGOMERY AR 2008 561 251760 05097 NA NA
4 MONTGOMERY AR 2009 554 244160 05097 NA NA
5 MONTGOMERY AR 2010 449 247990 05097 NA NA
6 MONTGOMERY AR 2011 560 313800 05097 NA NA
7 MONTGOMERY AR 2012 696 339520 05097 NA NA
8 MONTGOMERY AR 2013 703 382300 05097 NA NA
9 MONTGOMERY AR 2014 491 396900 05097 NA NA
10 PRINCE OF W… AK 2006 190 62700 02201 NA NA
# … with 17 more rows, and 8 more variables: county_name <chr>, NAME <chr>,
# variable <chr>, population <int>, Areaname <chr>, LND110210D <dbl>,
# density <dbl>, rural_urban <chr>
There does not appear to be land area and/or population data for these counties.
# A tibble: 9 x 14
BUYER_COUNTY BUYER_STATE countyfips STATE COUNTY county_name NAME variable
<chr> <chr> <fct> <int> <int> <chr> <chr> <chr>
1 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
2 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
3 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
4 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
5 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
6 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
7 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
8 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
9 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
# … with 6 more variables: year <fct>, population <int>, Areaname <chr>,
# LND110210D <dbl>, density <dbl>, rural_urban <chr>
# A tibble: 0 x 14
# … with 14 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, countyfips <fct>,
# STATE <int>, COUNTY <int>, county_name <chr>, NAME <chr>, variable <chr>,
# year <fct>, population <int>, Areaname <chr>, LND110210D <dbl>,
# density <dbl>, rural_urban <chr>
# A tibble: 0 x 14
# … with 14 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, countyfips <fct>,
# STATE <int>, COUNTY <int>, county_name <chr>, NAME <chr>, variable <chr>,
# year <fct>, population <int>, Areaname <chr>, LND110210D <dbl>,
# density <dbl>, rural_urban <chr>
# A tibble: 0 x 14
# … with 14 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, countyfips <fct>,
# STATE <int>, COUNTY <int>, county_name <chr>, NAME <chr>, variable <chr>,
# year <fct>, population <int>, Areaname <chr>, LND110210D <dbl>,
# density <dbl>, rural_urban <chr>
# A tibble: 1 x 34
Areaname STCOU LND010190F LND010190D LND010190N1 LND010190N2 LND010200F
<chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 Montgom… 05097 0 800. 0000 0000 0
# … with 27 more variables: LND010200D <dbl>, LND010200N1 <chr>,
# LND010200N2 <chr>, LND110180F <dbl>, LND110180D <dbl>, LND110180N1 <chr>,
# LND110180N2 <chr>, LND110190F <dbl>, LND110190D <dbl>, LND110190N1 <chr>,
# LND110190N2 <chr>, LND110200F <dbl>, LND110200D <dbl>, LND110200N1 <chr>,
# LND110200N2 <chr>, LND110210F <dbl>, LND110210D <dbl>, LND110210N1 <chr>,
# LND110210N2 <chr>, LND210190F <dbl>, LND210190D <dbl>, LND210190N1 <chr>,
# LND210190N2 <chr>, LND210200F <dbl>, LND210200D <dbl>, LND210200N1 <chr>,
# LND210200N2 <chr>
# A tibble: 0 x 34
# … with 34 variables: Areaname <chr>, STCOU <chr>, LND010190F <dbl>,
# LND010190D <dbl>, LND010190N1 <chr>, LND010190N2 <chr>, LND010200F <dbl>,
# LND010200D <dbl>, LND010200N1 <chr>, LND010200N2 <chr>, LND110180F <dbl>,
# LND110180D <dbl>, LND110180N1 <chr>, LND110180N2 <chr>, LND110190F <dbl>,
# LND110190D <dbl>, LND110190N1 <chr>, LND110190N2 <chr>, LND110200F <dbl>,
# LND110200D <dbl>, LND110200N1 <chr>, LND110200N2 <chr>, LND110210F <dbl>,
# LND110210D <dbl>, LND110210N1 <chr>, LND110210N2 <chr>, LND210190F <dbl>,
# LND210190D <dbl>, LND210190N1 <chr>, LND210190N2 <chr>, LND210200F <dbl>,
# LND210200D <dbl>, LND210200N1 <chr>, LND210200N2 <chr>
# A tibble: 0 x 34
# … with 34 variables: Areaname <chr>, STCOU <chr>, LND010190F <dbl>,
# LND010190D <dbl>, LND010190N1 <chr>, LND010190N2 <chr>, LND010200F <dbl>,
# LND010200D <dbl>, LND010200N1 <chr>, LND010200N2 <chr>, LND110180F <dbl>,
# LND110180D <dbl>, LND110180N1 <chr>, LND110180N2 <chr>, LND110190F <dbl>,
# LND110190D <dbl>, LND110190N1 <chr>, LND110190N2 <chr>, LND110200F <dbl>,
# LND110200D <dbl>, LND110200N1 <chr>, LND110200N2 <chr>, LND110210F <dbl>,
# LND110210D <dbl>, LND110210N1 <chr>, LND110210N2 <chr>, LND210190F <dbl>,
# LND210190D <dbl>, LND210190N1 <chr>, LND210190N2 <chr>, LND210200F <dbl>,
# LND210200D <dbl>, LND210200N1 <chr>, LND210200N2 <chr>
# A tibble: 0 x 10
# … with 10 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, countyfips <fct>,
# STATE <int>, COUNTY <int>, county_name <chr>, NAME <chr>, variable <chr>,
# year <int>, population <int>
# A tibble: 0 x 10
# … with 10 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, countyfips <fct>,
# STATE <int>, COUNTY <int>, county_name <chr>, NAME <chr>, variable <chr>,
# year <int>, population <int>
# A tibble: 0 x 10
# … with 10 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, countyfips <fct>,
# STATE <int>, COUNTY <int>, county_name <chr>, NAME <chr>, variable <chr>,
# year <int>, population <int>
We will now remove these rows before further analysis:
Question Opportunity
Do you recall how you would do this?
Click here to reveal the code.
Nice! Now we have no missing data.
Let’s also check if there were any counties in county_info that were not in the DEA annualDosage data?
# A tibble: 1,285 x 16
BUYER_COUNTY BUYER_STATE countyfips STATE COUNTY county_name NAME variable
<chr> <chr> <fct> <int> <int> <chr> <chr> <chr>
1 BRISTOL BAY AK 02060 2 60 Bristol Bay Bris… B01003_…
2 DILLINGHAM AK 02070 2 70 Dillingham Dill… B01003_…
3 LAKE AND PE… AK 02164 2 164 Lake and P… Lake… B01003_…
4 NOME AK 02180 2 180 Nome Nome… B01003_…
5 PRINCE OF W… AK 02198 2 201 Prince of … <NA> <NA>
6 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
7 SOUTHEAST F… AK 02240 2 240 Southeast … Sout… B01003_…
8 WADE HAMPTON AK 02270 2 270 Wade Hampt… Wade… B01003_…
9 WRANGELL AK 02275 2 280 Wrangell <NA> <NA>
10 YAKUTAT AK 02282 2 282 Yakutat Yaku… B01003_…
# … with 1,275 more rows, and 8 more variables: year <fct>, population <int>,
# Areaname <chr>, LND110210D <dbl>, density <dbl>, rural_urban <chr>,
# count <int>, DOSAGE_UNIT <dbl>
# A tibble: 174 x 1
countyfips
<fct>
1 02060
2 02070
3 02164
4 02180
5 02198
6 02232
7 02240
8 02270
9 02275
10 02282
# … with 164 more rows
# A tibble: 0 x 6
# … with 6 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, year <fct>,
# count <int>, DOSAGE_UNIT <dbl>, countyfips <fct>
# A tibble: 0 x 6
# … with 6 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, year <fct>,
# count <int>, DOSAGE_UNIT <dbl>, countyfips <fct>
There are 174 counties that don’t have any data in the DEA data. It is unclear why these counties are not included case. A google search of the Borden and Coke counties in Texas does not indicate anything usual about the counties in terms of when it was established or if it became part of another county later in time. It is important to keep in mind as we continue to analyze this data, that the ARCOS data from the DEA released by the Wasington Post does not include pill shipment information for all US counties.
We will begin by taking a deeper look at our data with some visualizations. We will use the ggplot2 package to create these visualizations.
Click here for an introduction about this package if you are new to using
ggplot2
The ggplot2 package is generally intuitive for beginners because it is based on a grammar of graphics or the gg in ggplot2. The idea is that you can construct many sentences by learning just a few nouns, adjectives, and verbs. There are specific “words” that we will need to learn and once we do, you will be able to create (or “write”) hundreds of different plots.
The critical part to making graphics using ggplot2 is the data needs to be in a tidy format. Given that we have just spent time putting our data in tidy format, we are primed to take advantage of all that ggplot2 has to offer!
We will show how it is easy to pipe tidy data (output) as input to other functions that create plots. This all works because we are working within the tidyverse.
What is the ggplot() function? As explained by Hadley Wickham:
The grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinates system.
ggplot2 Terminology:
x and y variable names)geom_point(), geom_bar(), geom_line(), geom_histogram()The function aes() is an aesthetic mapping function inside the ggplot() object. We use this function to specify plot attributes (e.g. x and y variable names) that will not change as we add more layers.
Anything that goes in the ggplot() object becomes a global setting. From there, we use the geom objects to add more layers to the base ggplot() object. These will define what we are interested in illustrating using the data.
Let’s make a plot to see how population density has changed over time in each state.
To do so we want to calculate a mean population density (across all the counties) for each state for each year.
We can do this using the group_by() and summarize() functions of the dplyr package. The group_by functions allows for the data to be arranged into groups for subsequent functions.
Thus, if we group only by state using the following code, you will see that this results in 51 groups (one for each state including Washington DC). This doesn’t change anything about the data itself (or even how it is printed asside from the groups written above the table), just how it will be handled in subsequent steps.
# A tibble: 26,980 x 16
# Groups: BUYER_STATE [51]
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips STATE COUNTY
<chr> <chr> <fct> <int> <dbl> <fct> <int> <int>
1 ABBEVILLE SC 2006 877 363620 45001 45 1
2 ABBEVILLE SC 2007 908 402940 45001 45 1
3 ABBEVILLE SC 2008 871 424590 45001 45 1
4 ABBEVILLE SC 2009 930 467230 45001 45 1
5 ABBEVILLE SC 2010 1197 539280 45001 45 1
6 ABBEVILLE SC 2011 1327 566560 45001 45 1
7 ABBEVILLE SC 2012 1509 589010 45001 45 1
8 ABBEVILLE SC 2013 1572 596420 45001 45 1
9 ABBEVILLE SC 2014 1558 641350 45001 45 1
10 ACADIA LA 2006 5802 1969720 22001 22 1
# … with 26,970 more rows, and 8 more variables: county_name <chr>, NAME <chr>,
# variable <chr>, population <int>, Areaname <chr>, LND110210D <dbl>,
# density <dbl>, rural_urban <chr>
Alternatively, if we group by year this results in 9 groups of data, one for each year.
# A tibble: 26,980 x 16
# Groups: year [9]
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips STATE COUNTY
<chr> <chr> <fct> <int> <dbl> <fct> <int> <int>
1 ABBEVILLE SC 2006 877 363620 45001 45 1
2 ABBEVILLE SC 2007 908 402940 45001 45 1
3 ABBEVILLE SC 2008 871 424590 45001 45 1
4 ABBEVILLE SC 2009 930 467230 45001 45 1
5 ABBEVILLE SC 2010 1197 539280 45001 45 1
6 ABBEVILLE SC 2011 1327 566560 45001 45 1
7 ABBEVILLE SC 2012 1509 589010 45001 45 1
8 ABBEVILLE SC 2013 1572 596420 45001 45 1
9 ABBEVILLE SC 2014 1558 641350 45001 45 1
10 ACADIA LA 2006 5802 1969720 22001 22 1
# … with 26,970 more rows, and 8 more variables: county_name <chr>, NAME <chr>,
# variable <chr>, population <int>, Areaname <chr>, LND110210D <dbl>,
# density <dbl>, rural_urban <chr>
We want to group by both BUYER_STATE and year, so that we get the mean of all the counties for each state for each year. If we only did by state, we would only get 51 summarized results, one for each state representing a mean across the years.
# A tibble: 26,980 x 16
# Groups: BUYER_STATE, year [459]
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips STATE COUNTY
<chr> <chr> <fct> <int> <dbl> <fct> <int> <int>
1 ABBEVILLE SC 2006 877 363620 45001 45 1
2 ABBEVILLE SC 2007 908 402940 45001 45 1
3 ABBEVILLE SC 2008 871 424590 45001 45 1
4 ABBEVILLE SC 2009 930 467230 45001 45 1
5 ABBEVILLE SC 2010 1197 539280 45001 45 1
6 ABBEVILLE SC 2011 1327 566560 45001 45 1
7 ABBEVILLE SC 2012 1509 589010 45001 45 1
8 ABBEVILLE SC 2013 1572 596420 45001 45 1
9 ABBEVILLE SC 2014 1558 641350 45001 45 1
10 ACADIA LA 2006 5802 1969720 22001 22 1
# … with 26,970 more rows, and 8 more variables: county_name <chr>, NAME <chr>,
# variable <chr>, population <int>, Areaname <chr>, LND110210D <dbl>,
# density <dbl>, rural_urban <chr>
We can see that this results in 459 groups. This makes sense because 51 groups over 9 years is 51 multiplied by 9, which equals 459.
We can then use the summarize function to create a new variable called sum_DENS which will be equal to the mean of the density variable for all the counties within each of the 449 groups. If we had missing values we would need to use the na.rm = TRUE argument to remove any missing values in our calculation.
# A tibble: 459 x 3
# Groups: BUYER_STATE [51]
BUYER_STATE year mean_DENS
<chr> <fct> <dbl>
1 AK 2006 11.6
2 AK 2007 12.2
3 AK 2008 13.9
4 AK 2009 13.0
5 AK 2010 13.2
6 AK 2011 14.2
7 AK 2012 13.5
8 AK 2013 14.6
9 AK 2014 14.7
10 AL 2006 87.3
# … with 449 more rows
OK! Now we are ready to make our first plot.
We will start with the ggplot()function to specify what variables will be used for the x-axis and y-axis, as well as if any variable should be used to specify different colors on the plot. This will result in a blank plot. Then we need to use a geom_* function to specify what type of plot we would like to make.
If you type geom_ into the console of RStudio, you will see a list of options.
We will create a scatter plot using the geom_point() function. We will also use the theme_minimal() function to change the overall aesthetics of the plot. See here for a list of options.
We will also use the theme() function to further specify how we want the plot to be displayed. We would like the x axis text to be anlged by 90 degrees. We can use the element_text() function to change aspects about the text. and we can use the axis.text.x argument to specify that we want to specifically change the text of the x axis. You can type theme( in the RStudio console and press tab to see a list of argument options for things that you can change in your plot.
Finally, we can use the
labs() function of the ggplot2 package to specify the labels of the plot.
Annual %>% group_by(BUYER_STATE, year) %>%
summarize(mean_DENS = mean(density, na.rm = TRUE)) %>%
ggplot(aes(x = BUYER_STATE, y = mean_DENS, col = year)) +
geom_point() +
theme_minimal()+
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90)) +
labs(x = "State",
title = "Mean County Population Density of each State",
y = "Mean Population Density (People per square mile)")We can see that the average state population density is fairly similar for most states. However DC, MA, NJ, NY, RI, and VA have much higher average county densities. We also see that DC shows the largest change over time, as we can see the other individual points for each year. For other states the change was so small that they are overlapping.
What about overall population density, how did the national average of all US counties change?
We will ignore the different states in this case and we will calculate the mean of all US counties for each year.
Question Opportunity
How might you create this plot?
Click here to reveal the code.
USavg_dens <-Annual %>% group_by( year) %>%
summarize(mean_DENS = mean(density)) %>%
ggplot(aes(x = year, y = mean_DENS)) +
geom_point() +
theme_minimal() +
theme(axis.title.x=element_blank()) +
labs(title = "US mean population density",
x = "year",
y = "population density (people per square mile)")In this case we saved the plot to an object called USavg_dens.
Overall the density has increased, if you take a look at the y-axis you can see that the density has changed by about 13 people per square mile from 2006 to 2014.
How does this compare with raw population values?
Annual %>%
group_by(year) %>%
summarise(total_population = sum(population)) %>%
ggplot(aes(x =year, y = total_population)) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(axis.title.x = element_blank()) +
labs(title = "US Population from 2006-2014",
x = "year",
y = "US total population")How have the number of rural and urban areas changed over years?
To determine how the number of each type of county has changed over time, we will use the count() function of the dlyr package after grouping by the year variable to count the number of occurances of the unique values (which are Rural and Urban) in the rural_ubran varaible.
# A tibble: 18 x 3
# Groups: year [9]
year rural_urban n
<fct> <chr> <int>
1 2006 Rural 2769
2 2006 Urban 235
3 2007 Rural 2760
4 2007 Urban 238
5 2008 Rural 2753
6 2008 Urban 238
7 2009 Rural 2756
8 2009 Urban 236
9 2010 Rural 2753
10 2010 Urban 238
11 2011 Rural 2756
12 2011 Urban 243
13 2012 Rural 2761
14 2012 Urban 243
15 2013 Rural 2754
16 2013 Urban 246
17 2014 Rural 2753
18 2014 Urban 248
In this case we can make a plot using two different geom_* layers together. Whatever geom_* layer is added last will be displayed on top. In this case we will use geom_point() and geom_smooth() to add a line connecting the points of the scatter plot of the geom_point() function.
avocado why does this look sooooo different for county_info and Annual - it appears that many of the US counties that are not represented in the DEA data .were rural
# A tibble: 174 x 3
# Groups: countyfips [174]
countyfips rural_urban n
<fct> <chr> <int>
1 02013 Rural 7
2 02050 Rural 1
3 02060 Rural 9
4 02068 Rural 6
5 02070 Rural 9
6 02105 Rural 9
7 02164 Rural 9
8 02180 Rural 7
9 02185 Rural 1
10 02188 Rural 6
# … with 164 more rows
Annual %>% group_by( year) %>%
count(rural_urban) %>%
ggplot(aes(x = year, y = n, col = rural_urban, group = rural_urban)) +
geom_point() +
geom_smooth() +
facet_wrap(~ rural_urban, scales = "free") +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90),
legend.title = element_blank()) +
labs(y = "Number of Counties",
x = "Year",
title = "Change in the number of the type of county in the US over time")As one might expect, it looks like the number of urban areas has increased, while the number of rural areas has decreased over time.
Let’s also create a table to look at the number of rural and urban counties over time. To do this we can use the package formattable. First we need to get the data into the format that we would like. We previously counted the numnber of Rural and Urban counties for each year. However, the data was presented in a format that is called long format. In this format, variables that could possibly be presented as seperate columns are condensed into fewer columns, while still maintaining only a single value per cell. The opposite of this format is called wide format data, which therefore has more columns and fewer rows. This is best illustrated with an example.
Here you can see wide data on the left in the following image with more columns and fewer rows and long data on the right where the month columns have been collapsed into two longer columns (one with the name of the month and one with the numeric value) resulting in fewer columns and more rows. While long format is very useful for creating plots with ggplot2 it is helpful to have the data in wide format for tables that someone would quickly read, which is our current goal.
Click here to see another example.
Here is an example of wide data about different measurements of a variety species of Iris flowers.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 5.0 3.3 1.4 0.2 setosa
3 7.7 3.8 6.7 2.2 virginica
4 4.4 3.2 1.3 0.2 setosa
5 5.9 3.0 5.1 1.8 virginica
6 6.5 3.0 5.2 2.0 virginica
7 5.5 2.5 4.0 1.3 versicolor
8 5.5 2.6 4.4 1.2 versicolor
9 5.8 2.7 5.1 1.9 virginica
10 6.1 3.0 4.6 1.4 versicolor
OK, so currently we have 4 different columns about measurments of different flowers. Since all of these measurments are similar, one might produce a new variable that is made up of the names of the first four variables and another that is the numeric value like so:
# A tibble: 40 x 3
Species Measurement Value
<fct> <chr> <dbl>
1 setosa Sepal.Length 4.3
2 setosa Sepal.Width 3
3 setosa Petal.Length 1.1
4 setosa Petal.Width 0.1
5 setosa Sepal.Length 5
6 setosa Sepal.Width 3.3
7 setosa Petal.Length 1.4
8 setosa Petal.Width 0.2
9 virginica Sepal.Length 7.7
10 virginica Sepal.Width 3.8
# … with 30 more rows
Now we will demonstrate how to make the counts of Rural and Urban data into wide format from long format. Here is our original data:
# A tibble: 18 x 3
# Groups: year [9]
year rural_urban n
<fct> <chr> <int>
1 2006 Rural 2769
2 2006 Urban 235
3 2007 Rural 2760
4 2007 Urban 238
5 2008 Rural 2753
6 2008 Urban 238
7 2009 Rural 2756
8 2009 Urban 236
9 2010 Rural 2753
10 2010 Urban 238
11 2011 Rural 2756
12 2011 Urban 243
13 2012 Rural 2761
14 2012 Urban 243
15 2013 Rural 2754
16 2013 Urban 246
17 2014 Rural 2753
18 2014 Urban 248
We would like the rural_urban data to be shown in two different columns; one that shows Rural counts and one that shows Urban counts, as this would be easier for people to read. We can use the pivot_wider() function of the tidyr package to do this. This takes two important arguments:
names_from - this argument indicates what variable to use to create the names of the new variablesvalues_from - this argument indicates what variable to use to fill in the values of the new variablesIn our case we will obtain the names from the rural_urban variable and the values from the n variable.
Annual %>%
group_by( year) %>%
count(rural_urban) %>%
tidyr::pivot_wider(names_from = rural_urban,
values_from = n)# A tibble: 9 x 3
# Groups: year [9]
year Rural Urban
<fct> <int> <int>
1 2006 2769 235
2 2007 2760 238
3 2008 2753 238
4 2009 2756 236
5 2010 2753 238
6 2011 2756 243
7 2012 2761 243
8 2013 2754 246
9 2014 2753 248
Nice!
Now, let’s also create two new variables that show the change in count of rural and urban counties from one year to the next. We can do so using the lag() function of the dplyr package. This function will find the previous value thus Rural- lag(Rural) will take the current row and subtract the previous row’s value. Note that is necessary to inlude the ungroup() function to stop grouping by year.
Annual%>%
group_by(year) %>%
count(rural_urban) %>%
tidyr::pivot_wider(names_from = rural_urban,
values_from = n) %>%
ungroup() %>%
mutate("Rural Change" = Rural - lag(Rural),
"Urban Change" = Urban - lag(Urban))# A tibble: 9 x 5
year Rural Urban `Rural Change` `Urban Change`
<fct> <int> <int> <int> <int>
1 2006 2769 235 NA NA
2 2007 2760 238 -9 3
3 2008 2753 238 -7 0
4 2009 2756 236 3 -2
5 2010 2753 238 -3 2
6 2011 2756 243 3 5
7 2012 2761 243 5 0
8 2013 2754 246 -7 3
9 2014 2753 248 -1 2
Let’s also add a column about the percent urban for each year. We will use the base round() function to round the percentages to 2 ditis after the decimal using the digits = 2 argument. Finally, we also rename the year variable to be Year using the rename() function of the dplyr package, which requires that the new name be listed before the = sign followed by the old name.
R_U <-Annual%>%
group_by(year) %>%
count(rural_urban) %>%
tidyr::pivot_wider(names_from = rural_urban,
values_from = n) %>%
ungroup() %>%
mutate("Rural Change" = Rural - lag(Rural),
"Urban Change" = Urban - lag(Urban),
"Percent Urban" = round((Urban/(Urban + Rural))*100, digits = 2)) %>%
rename("Year" = "year")
R_U# A tibble: 9 x 6
Year Rural Urban `Rural Change` `Urban Change` `Percent Urban`
<fct> <int> <int> <int> <int> <dbl>
1 2006 2769 235 NA NA 7.82
2 2007 2760 238 -9 3 7.94
3 2008 2753 238 -7 0 7.96
4 2009 2756 236 3 -2 7.89
5 2010 2753 238 -3 2 7.96
6 2011 2756 243 3 5 8.1
7 2012 2761 243 5 0 8.09
8 2013 2754 246 -7 3 8.2
9 2014 2753 248 -1 2 8.26
Nice, now we have a pretty easy to interpret table, but we can make it even easier to quickly assess trends in the data using the formattable package. The formmattable() funtion creates a formatted table, and takes a list of variables and thier an styalized version of each variable in which to add special formmatting. As a simple example, we will use the color_bar() function of this package to add color bars to the percent_urban column which shows changes in values by the width of a color bar.
| Year | Rural | Urban | Rural Change | Urban Change | Percent Urban |
|---|---|---|---|---|---|
| 2006 | 2769 | 235 | NA | NA | 7.82 |
| 2007 | 2760 | 238 | -9 | 3 | 7.94 |
| 2008 | 2753 | 238 | -7 | 0 | 7.96 |
| 2009 | 2756 | 236 | 3 | -2 | 7.89 |
| 2010 | 2753 | 238 | -3 | 2 | 7.96 |
| 2011 | 2756 | 243 | 3 | 5 | 8.10 |
| 2012 | 2761 | 243 | 5 | 0 | 8.09 |
| 2013 | 2754 | 246 | -7 | 3 | 8.20 |
| 2014 | 2753 | 248 | -1 | 2 | 8.26 |
Nice, now we can see how much the percentage has changed over time.
We also use the formatter() function to change the color of the Rural Change and Urban Change variables so that if the value is negative it will be red and if it is positive it will be green.
The formatter() function takes an HTML style tag name which can be any character string (altough generally one would use “span” using the .tag argument and a style argument where style aspects such as color can be specified using a color argument.
We will create a function called redgreen that will specify that if a value is less than zero it should be red and if it is greater than zero it should be green. To do this we will use the case_when() funcition like we did previously when creating our rural_urban variable of the county_info object in the Data Wrangling section.
To create a function we will use the base function() function. The inputs of the function are contained within the parantheses (), while the steps that should be performed on the input are contained within the curly brackets {}. In this case, our input will be called number. Now when redgreen is used this will perform the case_when evaluation on the input provided.
We can see that this function does indeed change numeric values to be the color names red and green.
[1] "green" NA "red"
Now this function can be used to replace the color values for the Rural Change and Urban Change variables.
formattable(R_U, list(`Percent Urban` = color_bar("#FA614B"),
`Rural Change` = formatter(.tag ="span",
style = ~style(
color = redgreen(number = `Rural Change`))),
`Urban Change` = formatter(.tag = "span",
style = ~style(
color = redgreen(number = `Urban Change`)))))| Year | Rural | Urban | Rural Change | Urban Change | Percent Urban |
|---|---|---|---|---|---|
| 2006 | 2769 | 235 | NA | NA | 7.82 |
| 2007 | 2760 | 238 | -9 | 3 | 7.94 |
| 2008 | 2753 | 238 | -7 | 0 | 7.96 |
| 2009 | 2756 | 236 | 3 | -2 | 7.89 |
| 2010 | 2753 | 238 | -3 | 2 | 7.96 |
| 2011 | 2756 | 243 | 3 | 5 | 8.10 |
| 2012 | 2761 | 243 | 5 | 0 | 8.09 |
| 2013 | 2754 | 246 | -7 | 3 | 8.20 |
| 2014 | 2753 | 248 | -1 | 2 | 8.26 |
R_U <- Annual %>% group_by(BUYER_STATE, year) %>%
count(rural_urban)
R_U %<>% pivot_wider( names_from = rural_urban, values_from = n)
R_U %<>%mutate(Urban = replace_na(Urban , 0))
R_U %<>% mutate(percent_urban = (Urban/(Urban+Rural))*100)
ggplot(R_U, color = "black", aes(x = BUYER_STATE, y = percent_urban, color = year)) +
geom_point() +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90),
legend.position = "bottom") +
coord_flip() +
scale_colour_viridis_d(option = "magma", end = 0, begin = 1, guide = guide_legend(nrow = 1)) +
labs(x = "Percent Urban",
title = "State differnces in percentage of counties that are urban over time")Now let’s get a sense of how the shipments of oxycodone and hydrocodone changed over time in general across all counties in the US.
Annual %>% group_by(year) %>%
ggplot( aes(x = year, y = (DOSAGE_UNIT/1000000))) +
geom_boxplot() +
labs(title = "Average Number of Opioid Pills Shipped to a US County",
y = "Number of pills in millions") +
theme_minimal()Annual %>% group_by(year) %>%summarize(mean = mean(DOSAGE_UNIT)) %>%
ggplot( aes(x = year, y = (mean/1000000))) +
geom_point() +
labs(title = "Average Number of Opioid Pills Shipped to a US County",
y = "Number of pills in millions") +
theme_minimal()It looks like the averge number of opioied pills shipped to a county peaked in 2011 and has slowly declined.
We can look a bit deaper if we only calculate a mean for each state
Annual %>% group_by(BUYER_STATE, year) %>%
summarise( mean_DOSAGE = mean(DOSAGE_UNIT)) %>% ungroup() %>%
ggplot(aes(x = year, y = (mean_DOSAGE/1000000))) +
geom_boxjitter()+
labs(title = "Average number of opioid pills shipped to a given county for each state",
y = "Number of pills in millions")+
theme_minimal()Again we see the same general overall trend, we also see that the spread was quite large with some states recieving many more pills.
To get a better sense of how each state changed over time we can create a line plot instead.
Annual %>%
group_by(BUYER_STATE,year) %>%
summarise( mean_DOSAGE = mean(DOSAGE_UNIT)) %>%
ungroup() %>%
ggplot(aes(x = year,
y = mean_DOSAGE,
group = BUYER_STATE,
color = BUYER_STATE)) +
geom_line( )Since we have so many states, the legend is not very useful. Instead we can use the girafe package to create an interactive plot that will tell people what state each line represents when they hover over different data points.
g<-Annual %>% group_by(BUYER_STATE,year) %>%
summarise( mean_DOSAGE = mean(DOSAGE_UNIT)) %>% ungroup() %>%
ggplot(aes(x = year, y = mean_DOSAGE, group = BUYER_STATE, color = BUYER_STATE)) +
geom_line()
g <- g + geom_point_interactive(aes(
color = BUYER_STATE,
tooltip = usdata::abbr2state(BUYER_STATE)),
size = 2,
alpha = 3/10) +theme(legend.position = "nune")
girafe(code = print(g))In this plot it appearst that the largest number of pills were shipped to counties in California However, since we did not account for population or population density, this could simply be because it is the most populated state. To account for this we will perform something called normalization to make a more fair comparison.
Rows: 26,980
Columns: 17
$ BUYER_COUNTY <chr> "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABB…
$ BUYER_STATE <chr> "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "L…
$ year <fct> 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 20…
$ count <int> 877, 908, 871, 930, 1197, 1327, 1509, 1572, 1558, 5802, …
$ DOSAGE_UNIT <dbl> 363620, 402940, 424590, 467230, 539280, 566560, 589010, …
$ countyfips <fct> 45001, 45001, 45001, 45001, 45001, 45001, 45001, 45001, …
$ STATE <int> 45, 45, 45, 45, 45, 45, 45, 45, 45, 22, 22, 22, 22, 22, …
$ COUNTY <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ county_name <chr> "Abbeville", "Abbeville", "Abbeville", "Abbeville", "Abb…
$ NAME <chr> "Abbeville County, South Carolina", "Abbeville County, S…
$ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", …
$ population <int> 25821, 25745, 25699, 25347, 25643, 25515, 25387, 25233, …
$ Areaname <chr> "Abbeville, SC", "Abbeville, SC", "Abbeville, SC", "Abbe…
$ LND110210D <dbl> 490.48, 490.48, 490.48, 490.48, 490.48, 490.48, 490.48, …
$ density <dbl> 52.64435, 52.48940, 52.39561, 51.67795, 52.28144, 52.020…
$ rural_urban <chr> "Rural", "Rural", "Rural", "Rural", "Rural", "Rural", "R…
$ pop_DOSAGE <dbl> 14.08234, 15.65119, 16.52165, 18.43335, 21.03030, 22.204…
example_data <-tibble(population = c(10, 50, 100),
pills = c(100, 100, 100))
example_data %<>% mutate(norm_pills = pills/population)
example_data# A tibble: 3 x 3
population pills norm_pills
<dbl> <dbl> <dbl>
1 10 100 10
2 50 100 2
3 100 100 1
You can see that on average 10 pills were shiped for each person in the first row. In the second row, the population is much larger, thus despite the same number of pills being shipped to this example county, there were only enough pills shipped for on average 2 per person. In the final row the population is very large, thus only enough pills were shipped to give on average 1 per person. Note that however, it is likely that only a small portion of of the county populations actually received the pills that were shipped to a given county, but this helps us get a sense of the relative amount shipped to each county and likely used by people in the county where the pills were shipped (although this also not certain).
Annual %>% group_by(year) %>%summarize(mean = mean(DOSAGE_UNIT/population)) %>%
ggplot( aes(x = year, y = (mean))) +
geom_col() +
labs(title = "Average Number of pills shipped per person for a given county",
y = "Normalized Number of pills") +
theme_minimal()Let’s see how this changes the state specific data.
g<-Annual %>% group_by(BUYER_STATE,year) %>%
summarise( mean_DOSAGE = mean(DOSAGE_UNIT/population)) %>% ungroup() %>%
ggplot(aes(x = year, y = mean_DOSAGE, group = BUYER_STATE, color = BUYER_STATE)) +
geom_line()
g <- g + geom_point_interactive(aes(
color = BUYER_STATE,
tooltip = usdata::abbr2state(BUYER_STATE)),
size = 2,
alpha = 3/10) +theme(legend.position = "nune")
girafe(code = print(g))This dramatically changed the resulting plot!
We can see that now Tennesee, Kentucky, and West Virginia were among the top to recieve pills relative to their populations.
p1 <- ggplot(Annual , aes(x = year, y = DOSAGE_UNIT, colour = year)) +
geom_boxplot()+
labs(title = "Without Normalization")+
theme_minimal()
p2 <- ggplot(Annual, aes(x = year, y = DOSAGE_UNIT/population, colour = year)) +
geom_boxplot()+
labs(title = "Population Normalization")+
theme_minimal()
ggarrange(p1, p2, nrow = 1, ncol=2)Annual %<>% mutate(pop_DOSAGE = DOSAGE_UNIT/population)
Annual %<>% mutate(in_millions_DOSAGE = DOSAGE_UNIT/1000000)
Annual %>% tidyr::pivot_longer(names_to = "type",
values_to = "value",
cols = c(in_millions_DOSAGE,
pop_DOSAGE)) %>%
mutate(type = forcats::fct_inorder(type)) %>%
glimpse()Rows: 53,960
Columns: 18
$ BUYER_COUNTY <chr> "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABB…
$ BUYER_STATE <chr> "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "S…
$ year <fct> 2006, 2006, 2007, 2007, 2008, 2008, 2009, 2009, 2010, 20…
$ count <int> 877, 877, 908, 908, 871, 871, 930, 930, 1197, 1197, 1327…
$ DOSAGE_UNIT <dbl> 363620, 363620, 402940, 402940, 424590, 424590, 467230, …
$ countyfips <fct> 45001, 45001, 45001, 45001, 45001, 45001, 45001, 45001, …
$ STATE <int> 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, …
$ COUNTY <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ county_name <chr> "Abbeville", "Abbeville", "Abbeville", "Abbeville", "Abb…
$ NAME <chr> "Abbeville County, South Carolina", "Abbeville County, S…
$ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", …
$ population <int> 25821, 25821, 25745, 25745, 25699, 25699, 25347, 25347, …
$ Areaname <chr> "Abbeville, SC", "Abbeville, SC", "Abbeville, SC", "Abbe…
$ LND110210D <dbl> 490.48, 490.48, 490.48, 490.48, 490.48, 490.48, 490.48, …
$ density <dbl> 52.64435, 52.64435, 52.48940, 52.48940, 52.39561, 52.395…
$ rural_urban <chr> "Rural", "Rural", "Rural", "Rural", "Rural", "Rural", "R…
$ type <fct> in_millions_DOSAGE, pop_DOSAGE, in_millions_DOSAGE, pop_…
$ value <dbl> 0.36362, 14.08234, 0.40294, 15.65119, 0.42459, 16.52165,…
Annual %>% pivot_longer(names_to = "type",
values_to = "value",
cols = c(in_millions_DOSAGE,
pop_DOSAGE)) %>%
mutate(type = forcats::fct_inorder(type)) %>%
ggplot( aes(x = year, y = value, colour = year)) +
geom_boxplot()+
facet_wrap(~type, scale = "free")+
theme_minimal() +
theme(axis.text.x = element_text(angle = 90),
legend.position = "none")Annual %>% pivot_longer(names_to = "type",
values_to = "value",
cols = c(DOSAGE_UNIT,
pop_DOSAGE,
dens_DOSAGE)) %>%
mutate(type = forcats::fct_inorder(type)) %>%
ggplot( aes(x = year, y = log(value), colour = year)) +
geom_boxplot()+
facet_wrap(~type, scale = "free")+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90),
legend.position = "none")The goal of normalization is to make every datapoint have the same scale so each feature is equally important.
Annual %>% pivot_longer(names_to = "type",
values_to = "value",
cols = c(DOSAGE_UNIT,
pop_DOSAGE))%>%
mutate(type = forcats::fct_inorder(type)) %>%
ggplot( aes(y = value, x = year, colour = rural_urban, group = rural_urban)) +
stat_summary(fun.y = mean,
fun.ymin = function(x) mean(x) - sd(x),
fun.ymax = function(x) mean(x) + sd(x),
geom = "pointrange") +
stat_summary(fun.y = mean,
geom = "line") +
facet_wrap( ~ type, scales = "free") +
labs(title = "Dosage Change Without Normalization")+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90))Recall that the article that surveyed heroin users in the Survey of Key Informants’ Patients Program and the Researchers and Participants Interacting Directly (RAPID) program found that
A much greater percentage of heroin users completing the survey in the SKIP Program reported currently living in small urban or nonurban areas than in large urban areas (75.2% vs 24.8%) at the time of survey completion.
This survey used self-declared area of current residence (large urban, small urban, suburban, or rural).
According to the Organization for Economic Co-operation and Development(OECD):
Urban population by city size is determined by population density and commuting patterns; this better reflects the economic function of cities in addition to their administrative boundaries. Urban areas in OECD countries are classified as: large metropolitan areas if they have a population of 1.5 million or more; metropolitan areas if their population is between 500 000 and 1.5 million; medium-size urban areas if their population is between 200 000 and 500 000; and, small urban areas if their population is between 50 000 and 200 000. This indicator is measured as a percentage of the national population.
Thus the small urban cutoff is populations less than 200,000.
We will now use this to see how this changes our results.
Annual %<>%
mutate(large_urban = case_when(population >= 200000 ~ "Large Urban",
population >= 50000 &
population < 200000 ~ "Small Urban or Rural",
population < 50000 ~ "Small Urban or Rural"))Annual %>% pivot_longer(names_to = "type",
values_to = "value",
cols = c(DOSAGE_UNIT,
pop_DOSAGE))%>%
mutate(type = forcats::fct_inorder(type)) %>%
ggplot( aes(y = value, x = year, colour = large_urban, group = large_urban)) +
stat_summary(fun.y = mean,
fun.ymin = function(x) mean(x) - sd(x),
fun.ymax = function(x) mean(x) + sd(x),
geom = "pointrange") +
stat_summary(fun.y = mean,
geom = "line") +
facet_wrap( ~ type, scales = "free") +
labs(title = "Dosage Change Without Normalization")+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90))Annual %>% pivot_longer(names_to = "type",
values_to = "value",
cols = c(DOSAGE_UNIT,
pop_DOSAGE))%>%
mutate(type = forcats::fct_inorder(type)) %>%
ggplot( aes(y = value, x = year, colour = large_urban)) +
geom_boxplot() +
facet_wrap( ~ type, scales = "free") +
labs(title = "Dosage Change Without Normalization")+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90))http://www.sthda.com/english/wiki/unpaired-two-samples-t-test-in-r
Welch Two Sample t-test
data: DOSAGE_UNIT by rural_urban
t = -36.714, df = 2183.1, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-18391627 -16526480
sample estimates:
mean in group Rural mean in group Urban
2235774 19694828
Welch Two Sample t-test
data: DOSAGE_UNIT by large_urban
t = 47.154, df = 2748.6, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
18481629 20085375
sample estimates:
mean in group Large Urban mean in group Small Urban or Rural
20959044 1675541
Welch Two Sample t-test
data: pop_DOSAGE by rural_urban
t = -0.12109, df = 2408.2, p-value = 0.9036
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.370206 1.210825
sample estimates:
mean in group Rural mean in group Urban
39.52397 39.60366
Welch Two Sample t-test
data: pop_DOSAGE by large_urban
t = -15.749, df = 5036.4, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.516278 -4.294946
sample estimates:
mean in group Large Urban mean in group Small Urban or Rural
35.12367 40.02928
We can see that the result is different depending on how we define the data and how we normalize the data!
RStudio
Cheatsheet on RStuido IDE
Other RStudio cheatsheets
RStudio projects
String manipulation cheatsheet
Table formats
Geocoding
Coordinate reference system (CRS) ESPG World Geodetic System (WGS) version 84 also called ESPG:4326
Albers equal-area conic projection
crs 102008
To learn more about geospatial coordinate systems see here and here.
ggplot2 package
Please see this case study for more details on using ggplot2
grammar of graphics
ggplot2 themes
Motivating article for this case study about school shootings
Also see this article to learn more about the impacts of school shootings.
The RStudio cheatsheet for R Markdown and this tutorial are great for getting started.
Packages used in this case study:
| Package | Use in this case study |
|---|---|
| readxl | to import an excel file |
| httr | to retrieve data from an API |
| tibble | to create tibbles (the tidyverse version of dataframes) |
| jsonlite | to parse json files |
| stringr | to manipulate character strings within the data (subset and detect parts of strings) |
| dplyr | to filter, subset, join, modify, and summarize the data |
| magrittr | to pipe sequential commands |
| tidyr | to change the shape or format of tibbles to wide and long |
| naniar | to get a sense of missing data |
| ggplot2 | to create plots |
| forcats | to reorder factor for plot |
| ggpol | to create plots that are have jitter and half boxplots |
| ggiraph | to create interactive plots |
| formattable | to create a formatted table |
If you or a loved one is stuggling with opioid addiction, contact the SAMHSA’s National Helpline at 1-800-662-HELP (4357).
It is a free, confidential, 24/7, 365-day-a-year treatment referral and information service (in English and Spanish) for individuals and families facing mental and/or substance use disorders.
You can also contact the Addiction Center at (877)871-3575 which also has a confidential 24/7 live chat at: https://www.addictioncenter.com/drugs/overdose/.
According to their website:
Remember, that being able to treat an overdose at home is not a replacement for a hospital. Even if the moment has passed, and the victim seems fine, there is still a chance that something is going on that cannot be seen by the human eye. Taking the victim to the hospital, can mean the difference between life and death.
Overdose is a scary word. We often associate it with death, but the two are not always connected. Life can go on after an overdose, but only if the person suffering understands and learns from it. Getting on the road to recovery is not easily done but it is always possible, and the only guaranteed way to never suffer an overdose again. If you don’t know where this path begins, or need help getting help for a loved one, please reach out to a dedicated treatment provider. They’re here, 24/7, to answer any questions you may have. Be it for yourself or someone else.
According to harmreduction.org, the following are signs of an overdose:
If someone is making unfamiliar sounds while “sleeping” it is worth trying to wake him or her up. Many loved ones of users think a person was snoring, when in fact the person was overdosing. These situations are a missed opportunity to intervene and save a life.
Sometimes it can be difficult to tell if a person is just very high, or experiencing an overdose. If you’re having a hard time telling the difference, it is best to treat the situation like an overdose – it could save someone’s life.
The most important thing is to act right away!
R version 4.0.1 (2020-06-06)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gdtools_0.2.2 formattable_0.2.0.1 ggpubr_0.4.0
[4] ggiraph_0.7.8 ggpol_0.0.6 forcats_0.5.0
[7] ggplot2_3.3.1 naniar_0.5.1 tidyr_1.1.0
[10] dplyr_1.0.0 magrittr_1.5 stringr_1.4.0
[13] jsonlite_1.7.1 httr_1.4.2 tibble_3.0.1
[16] readxl_1.3.1 readr_1.3.1 knitr_1.28
[19] here_0.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 lattice_0.20-41 assertthat_0.2.1 rprojroot_1.3-2
[5] digest_0.6.25 utf8_1.1.4 R6_2.4.1 cellranger_1.1.0
[9] plyr_1.8.6 backports_1.1.7 visdat_0.5.3 evaluate_0.14
[13] pillar_1.4.4 rlang_0.4.6 curl_4.3 uuid_0.1-4
[17] data.table_1.12.8 car_3.0-8 Matrix_1.2-18 rmarkdown_2.2
[21] splines_4.0.1 labeling_0.3 foreign_0.8-80 htmlwidgets_1.5.1
[25] munsell_0.5.0 broom_0.5.6 compiler_4.0.1 xfun_0.14
[29] pkgconfig_2.0.3 systemfonts_0.2.2 base64enc_0.1-3 mgcv_1.8-31
[33] htmltools_0.4.0 tidyselect_1.1.0 rio_0.5.16 viridisLite_0.3.0
[37] fansi_0.4.1 crayon_1.3.4 withr_2.2.0 grid_4.0.1
[41] nlme_3.1-148 gtable_0.3.0 lifecycle_0.2.0 scales_1.1.1
[45] zip_2.0.4 cli_2.0.2 stringi_1.4.6 carData_3.0-4
[49] farver_2.0.3 ggsignif_0.6.0 ellipsis_0.3.1 generics_0.0.2
[53] vctrs_0.3.1 cowplot_1.0.0 openxlsx_4.1.5 tools_4.0.1
[57] glue_1.4.1 purrr_0.3.4 hms_0.5.3 abind_1.4-5
[61] yaml_2.2.1 usdata_0.1.0 colorspace_1.4-1 rstatix_0.6.0
[65] haven_2.3.1
We would like to acknowledge Elizabeth Stuart for assisting in framing the major direction of the case study.
We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.